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Abstract: 

In this paper, we propose the first exact algorithm for minimizing the difference of 
two submodular functions (D.S.), i.e., the discrete version of the D.C. programming 
problem. The developed algorithm is a branch-and-bound-based algorithm which 
responds to the structure of this problem through the relationship between sub- 
modularity and convexity. The D.S. programming problem covers a broad range 
of applications in machine learning because this generalizes the optimization of 
a wide class of set functions. We empirically investigate the performance of our 
algorithm, and illustrate the difference between exact and approximate solutions 
respectively obtained by the proposed and existing algorithms in feature selection 
and discriminative structure learning. 



1. Introduction 

Combinatorial optimization techniques have been actively applied to many ma- 
chine learning applications, where submodularity often plays an important role to 
develop algorithms [TUJ [TCI [27J HH [HJ [HI [J . In fact, many fundamental problems 
in machine learning can be formulated as submoular optimization. One of the im- 
portant categories would be the D.S. programming problem, i.e., the problem of 
minimizing the difference of two submodular functions. This is a natural formu- 
lation of many machine learning problems, such as learning graph matching [3], 
discriminative structure learning 1211 , feature selection |T] and energy minimization 

EH- 

In this paper, we propose a prismatic algorithm for the D.S. programming prob- 
lem, which is a branch-and-bound-based algorithm responding to the specific struc- 
ture of this problem. To the best of our knowledge, this is the first exact algorithm 
to the D.S. programming problem (although there exists an approximate algorithm 
for this problem [21] )■ As is well known, the branch- and-bound method is one of the 
most successful frameworks in mathematical programming and has been incorpo- 
rated into commercial softwares such as CPLEX [13j [12] . We develop the algorithm 
based on the analogy with the D.C. programming problem through the continuous 
relaxation of solution spaces and objective functions with the help of the Lovasz 
extension jTTl [TTJ [TB] . The algorithm is implemented as an iterative calculation of 
binary-integer linear programming (BILP). 

Also, we discuss applications of the D.S. programming problem in machine learn- 
ing and investigate empirically the performance of our method and the difference 
between exact and approximate solutions through feature selection and discrimina- 
tive structure- learning problems. 



The remainder of the paper is organized as follows. In Section [5J we give the 
formulation of the D.S. programming problem and then describe its applications in 
machine learning. In Section [H] we give an outline of the proposed algorithm for 
this problem. Then, in Section[4l we explain the details of its basic operations. And 
finally, we give several empirical examples using artificial and real-world datasets 
in Section [5] and conclude the paper in Section [6] 



Preliminaries and Notation: A set function / is called submodular if f(A) + f(B) > 
f(A UB) + f(A n B) for all A,B CN, where N = {1, • • • , n} [5] 0. Throughout 
this paper, we denote by / the Lovasz extension of /, i.e., a continuous function 
/ : R" R defined by 

f(p) = Y^=liPj ~Pj+l)f{Uj) +Pmf{U m ), 

where Uj = {i G N : pi > pj} and p\ > ■ ■ ■ > p m are the m distinct elements of p 
[17l[T8]. Also, we denote by I a £ {0,1}™ the characteristic vector of a subset A 6 N, 
i.e., Ia — J2ieA ei wnere e t is the i-th unit vector. Note, through the definition 
of the characteristic vector, any subset A G N has the one-to-one correspondence 
with the vertex of a n-dimensional cube D := {x e R™ : < Xi < l(i = 1, . . . , n)}. 
And, we denote by (A, t) (T) all combinations of a real value plus subset whose 
corresponding vectors (I a, t) are inside or on the surface of a polytope T 6 R n+1 . 



2. The D.S. Programming Problem and its Applications 

Let / and g are submodular functions. In this paper, we address an exact 
algorithm to solve the D.S. programming problem, i.e., the problem of minimizing 
the difference of two submodular functions: 



(1) mm f (A) - g(A). 

As is well known, any real-valued function whose second partial derivatives are 
continuous everywhere can be represented as the difference of two convex functions 
[12] . As well, the problem ([T]) generalizes a wide class of set-function optimization 
problems. Problem (TIJ covers a broad range of applications in machine learning 
[2T1 [24l E2 [I]. Here, we give a few examples. 



Feature selection using structured- spar sity inducing norms. Sparse methods for su- 
pervised learning, where we aim at finding good predictors from as few variables as 
possible, have attracted interest from machine learning community. This combina- 
torial problem is known to be a submodular maximization problem with cardinality 
constraint for commonly used measures such as least-squared errors [H [14] . And as 
is well known, if we replace the cardinality function with its convex envelope such 
as Zi-norm, this can be turned into a convex optimization problem. Recently, it 
is reported that submodular functions in place of the cardinality can give a wider 
family of polyhedral norms and may incorporate prior knowledge or structural 
constraints in sparse methods Then, the objective (that is supposed to be min- 
imized) becomes the sum of a loss function (often, supermodular) and submodular 
regularization terms. 



Figure 1. Illustration of the prismatic algorithm for the D.S. pro- 
gramming problem. 

Discriminative structure learning. It is reported that discriminatively structured 
Bayesian classifier often outperforms generatively one [HJ [22]. One commonly 
used metric for discriminative structure learning would be EAR (explaining away 
residual) [2] . EAR is defined as the difference of the conditional mutual information 
between variables by class C and non-conditional one, i.e., I{Xi;Xj\C)—I(Xi\Xj), 
In structure learning, we repeatedly try to find a subset in variables that minimize 
this kind of measure. Since the (symmetric) mutual information is a submodular 
function, obviously this problem leads the D.S. programming problem |21) . 

Energy minimization in computer vision. In computer vision, images is often mod- 
eled with a Markov random fields, where each node represents a pixel. Let Q — 
(V,£) be the undirected graph, where a label x s € C is assigned on each node. 
Then, many tasks in computer vision can be naturally formulated in terms of en- 
ergy minimization where the energy function has the form: E(x) = X)pev @p( x p) + 
J2(p q )es @( x pi x q)i where p (i) and @p,q(i,j) are univariate and pairwise potentials. 
In a pairwise potential, submodularity is defined as 9 pq (x p ,x q ) + 6 pq (x' p ,x' q ) > 
9 pq ((x p ,x q ) A (x' p ,x' q )) + 6 pq ((x p ,x q ) V (x' p ,x' q )) (see, for example, [26]). Based on 
this, many energy function in computer vision can be written with a submodu- 
lar function Ei(x) and a supermodular function E%(x) as E(x) = Ei(x) + E%{x) 
(ex. [51]). Or, in case of binarized energy (i.e., C — {0, 1}), even if such explicit 
decomposition is not known, a non-unique decomposition to submodular and su- 
permodular functions can be always given |25j . 

3. Prismatic Algorithm for the D.S. Programming Problem 

By introducing an additional variable t(£ M), Problem ([T]) can be converted into 
the equivalent problem with a supermodular objective function and a submodular 
feasible set, i.e., 

(2) min t - g(A) s.t. f(A) - t < 0. 

Obviously, if (A* ,t*) is an optimal solution of Problem ([2]), then A* is an optimal 
solution of Problem (JlJ and t* = f(A*). The proposed algorithm is a realization 
of the branch-and-bound scheme which responds to this specific structure of the 
problem. 



To this end, we first define a prism T = T(S) C R n+1 by 

T = {(x,t) E R" x R : x E S}, 

where S is an n-simplex. S is obtained from the n-dimensional cube D at the initial 
iteration (as described in Section l4~Tj) . or by the subdivision operation described 
in the later part of this section (and the detail will be described in Section |4"?21 . 
The prism T has n+ 1 edges that are vertical lines (i.e., lines parallel to the t-axis) 
which pass through the n + 1 vertices of S, respectively [TT] . 

Our algorithm is an iterative procedure which mainly consists of two parts; 
branching and bounding, as well as other branch-and-bound frameworks |13j . In 
branching, subproblems are constructed by dividing the feasible region of a parent 
problem. And in bounding, we judge whether an optimal solution exists in the 
region of a subproblem and its descendants by calculating an upper bound of the 
subproblem and comparing it with an lower bound of the original problem. Some 
more details for branching and bounding are described as follows. 

Branching. The branching operation in our method is carried out using the prop- 
erty of a simplex. That is, since, in a n-simplex, any r + 1 vertices are not 
on a r — 1-dimensional hyperplane for r < n, any n-simplex can be divided as 
S = Ui=i where p > 2 and Si are n-simplices such that each pair of sim- 
plices Si,Sj(i ^ j) intersects at most in common boundary points (the way of 
constructing such partition is explained in Section [4. 2p . Then, T — \J% =1 Tj, where 
Tj = {(x,t) E R" x R : x E Si}, is a natural prismatic partition of T induced by 
the above simplical partition. 

Bounding. For the bounding operation on S (resp., T), we consider a polyhedral 
convex set P such that P D D, where D = {(x, t) E M™ x R : x E D, f(x) < t} is 
the region corresponding to the feasible set of Problem ([2]). At the first iteration, 
such P is obtained as 

P = {(x, t) € R" x R : x e S, t > t}, 

where t is a real number satisfying t < min{/(A) : A E N}. Here, t can be 
determined by using some existing submodular minimization solver [231 IS] ■ Or, at 
later iterations, more refined P, such that Pq D Pi D ■ ■ ■ D D, is constructed as 
described in Section l4~4l 

As described in Section |4~51 a lower bound /3(T) of t — g{A) on the current prism 
T can be calculated through the binary-integer linear programming (BILP) (or the 
linear programming (LP)) using P, obtained as described above. Let a be the 
lowest function value (i.e., an upper bound of t — g(A) on D) found so far. Then, if 
/3(T) > a, we can conclude that there is no feasible solution which gives a function 
value better than a and can remove T without loss of optimality. 

The pseudo-code of the proposed algorithm is described in Algorithm [TJ In the 
following section, we explain the details of the operations involved in this algorithm. 

4. Basic Operations 

Obviously, the procedure described in Section |3] involves the following basic op- 
erations: 

(1) Construction of the first prism: A prism needs to be constructed from a 
hypercube at first, 



1 Construct a simplex Sq D D, its corresponding prism To and a polyhedral 
convex set Pq D D. 

2 Let «o be the best objective function value known in advance. Then, solve 
the BILP ([5]) corresponding to oq and Tq, and let /3q = fi(To,Po,uo) and 
(A , to) be the point satisfying f3 Q = i — g(A°). 

3 Set TZ <- T . 

4 while TZ k ^ 

5 Select a prism T£ £ K k satisfying k = P(T k ), (v k ,t k ) £ T fe *. 

6 if (v k ,t k ) £ Z? then 

7 |_ Set P k+1 = P k . 

8 else 

Construct l k (x,t) according to (jHJ), and set 
P fc+1 = {(a;,t) e P k : h(x,t) < 0}. 

10 Subdivide Tj* = T(S%) into a finite number of subprisms T k j(j£j k ) 
(cf. Section S3). 

11 For each j G Jfe, solve the BILP ((3} with respect to T^j, P^+i and a k . 

12 Delete all T k j(jeJ k ) satisfying (DR1) or (DR2). Let TZ' k denote the 
collection of remaining prisms T k j(j € J k ), and for each T £ M! k set 

(i(T) = max{/3(T fe *),/3(T,P fc+1 ,a fc )}. 

i 3 Let Ffc be the set of new feasible points detected while solving BILP in 
Step 11, and set 

a k+1 = min{a fc ,min{t - g(A) : (A,t) £ F k }}. 

i i Delete all T£M k satisfying /3(T)>a k+1 and let K k be H k - X \ T k £ M k . 
15 Set M k+1 ^{K k \{T*})\JM' k and k+1 ^mm{0{T):T£M k+ i}. 

Algorithm 1: Pseudo-code of the prismatic algorithm for the D.S program- 
ming problem. 



(2) Subdivision process: A prism is divided into a finite number of sub-prisms 
at each iteration, 

(3) Bound estimation: For each prism generated throughout the algorithm, a 
lower bound for the objective function t ~ g(A) over the part of the feasible 
set contained in this prism is computed, 

(4) Construction of cutting planes: Throughout the algorithm, a sequence of 
polyhedral convex sets Pq, Pi, ■ ■ ■ is constructed such that Po D Pi D ■ ■ ■ D 
D. Each set Pj is generated by a cutting plane to cut off a part of Pj—i, 
and 

(5) Deletion of no-feasible prisms: At each iteration, we try to delete prisms 
that contain no feasible solution better than the one obtained so far. 

4.1. Construction of the first prism. The initial simplex So D D (which yields 
the initial prism To D D) can be constructed as follows. Now, let v and A v be a 
vertex of D and its corresponding subset in N, respectively, i.e., v = ^2 ieA &%■ 
Then, the initial simplex SoDfl can be constructed by 

S = {x £ R n : Xi < l(i £ A v ),x i > 0(i £ N\A v ),a T x < 7}, 



where a = EieJV\A &i — XieA e * an< ^ 7 = 1-^ \ -^"l- The n + 1 vertices of So are 
w and the n points where the hyperplane {x £ R" : a T a; = 7} intersects the edges 
of the cone {x £ M" : < l(z 6 v4„),Xi > Q(i £ N \ A v )}. Note this is just an 
option and any n-simplex S D D is available. 

4.2. Sub-division of a prism. Let Sk and Tk be the simplex and prism at fc-th 
iteration in the algorithm, respectively. We denote Sk as Sk — [v k , . . . ,v k +1 ] := 
conv{t>J,, . . . , v k +1 } which is defined as the convex full of its vertices v\, . . . , v k +1 . 
Then, any r £ Sk can be represented as 

r = E7=i^4, Eri 1 A* = 1, A* > (* = 1, . . . ,n + 1). 
Suppose that r ^ v k (i = 1, . . . , n + 1). For each i satisfying Xi > 0, let Si be the 
subsimplex of Sk defined by 

(3) Sl = \vl...,v)T\r : v^\...,v^}. 

Then, the collection {S k : Xi > 0} defines a partition of Sk, i.e., we have [12] 
U K >o S l = Sk, mt Si n int S£ =0 for i j. 

In a natural way, the prisms T(Sj.) generated by the simplices Sj. defined in Eq. (J3]) 
form a partition of Tk- This subdivision process of prisms is exhaustive, i.e., for 
every nested (decreasing) sequence of prisms {T q } generated by this process, we 
have n^Lo ^1 = T ' wnere t is a line perpendicular to K™ (a vertical line) [TT] . 
Although several subdivision process can be applied, we use a classical bisection 
one, i.e., each simplex is divided into subsimplices by choosing in Eq. ([3]) as 

r = (v%+vi:)/2, 

where Wv"^ 1 — v k 2 \\ = ma,x{\\v k — v k \\ : i.j £ {0, . . . , n}, i ^ j} (see Figure [1]). 

4.3. Lower bounds. Again, let Sk and Tk be the simplex and prism at fc-th it- 
eration in the algorithm, respectively. And, let a be an upper bound of t — g(A), 
which is the smallest value of t — g(A) attained at a feasible point known so far in 
the algorithm. Moreover, let Pk be a polyhedral convex set which contains D and 
be represented as 

(4) P k = {(x,t) £R n xR: A k x + a k t<bk}, 

where Ak is a real (to x rt)-matrix and ak,bk £ K m Q Now, a lower bound 
j3(Tk,Pk,ct) of t — g{A) over Tk fl D can be computed as follows. In this sec- 
tion, we describe only the BILP implementation. The LP one and some empirical 
comparison are discussed in the supplementary document. 

First, let v\ (i = 1, . . . , n + 1) denote the vertices of Sk, and define I(Sk) = {i £ 
{l,...,n + l} :v\ £ 1"} and 

f nnn{ a ,min{/K) - $(«£) : i G I(S)}}, if 7(5) ? 0, 
^ la, if 7(5) = 0. 

For each i = l,...,n + l, consider the point (v* k , t\) where the edge of Tk passing 
through v\ intersects the level set {(x, t) : t — g(x) = fj,}, i.e., 

ti=9(vi)+» (t = l,...,n + l). 

Note that is updated at each iteration, which does not depend on Sk, as described in 
Section HH 



Then, let us denote the uniquely defined hyperplane through the points by 
fl = {(i,i)el"xR: p T x - t = 7, where pel" and 7 e M. Consider the upper 
and lower halfspace generated by if, i.e., H + = {(x, t) g M" x K : p T a; — t < 7} 
and iJ_ = {(a;, t) e W 1 x K : p T cc - t > 7}. If T fe n Z> C ff+ , then we see from the 
supermodularity of g(A) (equivalently, the concavity of g(x)) that 

mm{t-g(A) : (A,t) G (A, t)(T k n £>)} > mm{t - g(A) : (A,t) G (A,t)(T k n H+)} 

> mm{t - 0(sc) : (x, t) G T fc n #+} 

= min{i - <?(a5) : (sc,t) G {(v\, t\), . . . , (v k +1 , t k +1 )}} = fi. 

Otherwise, we shift the hyperplane H (downward with respect to t) until it reaches 
a point z = (x*,t*) (g T k n P n i?_,cc* g B n ) ((a;*,t*) is a point with the largest 
distance to H and the corresponding pair (A,t) (since x* g B") is in (A,t)(T k n 
P n #-))■ Let P denote the resulting supporting hyperplane, and denote by P+ 
the upper halfspace generated by H. Moreover, for each i = + 1, let 

z l = {v k ,i\) be the point where the edge of T passing through v\ intersects H. 
Then, it follows (A,t)(T k n D) C {A,t)(T k n P) C (A,i)(r fe n and hence 

min{t-g{A) : (A,t) e (A, t)(T k n £>)} > min{t - g(A) : OM) G (A,t)(T fc n #+)} 

= min{ifc - : i = 1, . . . ,n + 1}. 

Now, the above consideration leads to the following BILP in (\,x,t): 

max fe'^iAi - i) s.t. Ax + at < b, x = J2?=i^i v i> x E B n , 

(5) A,a>,t V / 

Er=i 1 ^ = l, Ai>0(t = l,...,n + 1), 
where A, a and b are given in Eq. ((4]). 

Proposition 1. (et^ If the system ([5]) /ias no solution, then intersection (4,i)(Tfl 
D) is empty. 

(b) Otherwise, let (\* ,x* ,t*) be an optimal solution of BILP (|5|) and c* = ^1=1 — 
£* its optimal value, respectively. Then, the following statements hold: 
(bl) Ifc* < 0, then (A, t) (T n D) C (A,t)(H + ). 

(b2) Ifc* > 0, then z = \v\, t* k ), z* = = K>4 " <?) <™d 

tl k-9( v l) =M-c* (i = l,...,n + l). 

Proof First, we prove part (a). Since every point in S k is uniquely representable 
as x = 53i=i ^i v% i we see from Eq. Q that the set (A, t)(T k (~l P) coincide with 
the feasible set of problem ©• Therefore, if the system (|S|) has no solution, then 
(A, t)(T k n P) = 0, and hence (A, t)(T k n Z>) = (because D C P). 
Next, we move to part (b). Since the equation of H is p T x — t — 7, it follows 
that determining the hyperplane H and the point z amounts to solving the binary 
integer linear programming problem: 

(6) max p T x - t s.t. (x,t) g T n P, x g B". 
Here, we note that the objective of the above can be represented as 

p t x - 1 = P T (z£}\iv£) - 1 = e^i\p t v\ - 1. 

On the other hand, since (v l , ti) g H, we have p T v % —ti = 7 (i = 1, . . . , n+ 1), and 
hence 

P T x-t= Er=/Ai(7 +**)-*= Er=/*iAi - * + 7. 



Thus, the two BILPs |5J and ^ are equivalent. And, if 7* denotes the optimal 
objective function value in Eq. ©, then 7* = c* + 7. If 7* < 7, then it follows 
from the definition of H + that H is obtained by a parallel shift of H in the di- 
rection H + . Therefore, c* < implies (A,t)(T fc n P k ) C (A,t)(ff + ), and hence 
(A,t)(T k nD)c(A,t)(H+). 

Since iJ = {(jc,i) el"xK: p T cc-£ = 7*} and H = {(x,t) el"xl: p T x-t = 7} 
we see that for each intersection point (v % k , t k ) (and (v k , t k )) of the edge of T k passing 
through v\ with H (and H), we have p T v k — t\ = 7* and p T v k — t\ = 7, respec- 
tively. This implies that t\ = t\ + 7 — 7* = tj. - c*, and (using t\ — g{v\) + fi) that 
4=5(4)+M-c*. ■ 

From the above, we see that, in the case (bl), fi constitutes a lower bound of 
(t — g{A)) wheres, in the case (b2), such a lower bound is given by min{t^. — g{v\) : 
i = 1, . . . , n + 1}. Thus, Proposition [T] provides the lower bound 

{+00, if BILP (O has no feasible point, 
H, if c* < 0, 
H - c* if c* > 0. 

As stated in Section l4~5l T k can be deleted from further consideration when /3 k = 00 
or fi. 

4.4. Outer approximation. The polyhedral convex set P D D used in the pre- 
ceding section is updated in each iteration, i.e., a sequence P ,Pi, - ■ ■ is constructed 
such that Po D Pi D ■ ■ ■ D D. The update from P k to P k +i (k = 0, 1, . . .) is done 
in a way which is standard for pure outer approximation methods [12] . That is, a 
certain linear inequality l k (x,t) < is added to the constraint set defining P k , i.e., 
we set 

P k+1 = P k n{{x,t) eR n xWL:l k {x,t) <0}. 

The function l k (x,t) is constructed as follows. At iteration k, we have a lower 
bound P k of t — g(A) as defined in Eq. ((7]) with P = P k , and a point (v k ,t k ) 
satisfying t k — g{v k ) — f3 k . We update the outer approximation only in the case 
(v k ,i k ) ^ D. Then, we can set 

(8) l k {x, t) = s T k [{x, t) - z k ] + (f(xt) - tl), 

where s k is a subgradient of f(x) — t at z k . The subgradient can be calculated as, 
for example, stated in [9] (see also [7]). 

Proposition 2. The hyperplane {(x,t) € M. n x R : l k (x,t) = 0} strictly separates 
z k from D, i.e., l k (z k ) > 0, and l k (x, t) < for v (x, t) € D. 

Proof Since we assume that z k ^ D, we have l k (z k ) = (f(x k ) — t* k ). And, the 
latter inequality is an immediate consequence of the definition of a subgradient. ■ 

4.5. Deletion rules. At each iteration of the algorithm, we try to delete certain 
subprisms that contain no optimal solution. To this end, we adopt the following 
two deletion rules: 

(DR1): Delete T k if BILP © has no feasible solution. 

(DR2): Delete T k if the optimal value c* of BILP (0) satisfies c* < 0. 




Figure 2. Training errors, test errors and computational time ver- 
sus A for the prismatic algorithm and the supermodular-sumodular 
procedure. 



p n k 


exact(PRISM) SSP greedy lasso 


120 150 5 
120 150 10 
120 150 20 
120 150 40 


1.8e-4 (192.6) 1.9e-4 (0.93) 1.8e-4 (0.45) 1.9e-4 (0.78) 
2.0e-4 (262.7) 2.4c-4 (0.81) 2.3e-4 (0.56) 2.4e-4 (0.84) 
7.3e-4 (339.2) 7.8e-4 (1.43) 8.3e-4 (0.59) 7.7e-4 (0.91) 
1.7e-3 (467.6) 2.1e-3 (1.17) 2.9e-3 (0.63) 1.9e-3 (0.87) 



Table 1. Normalized mean-square prediction errors of training 
and test data by the prismatic algorithm, the supermodular- 
submodular procedure, the greedy algorithm and the lasso. 



The feasibility of these rules can be seen from Proposition [T] as well as the D.C. 
programing problem [11]. That is, (DR1) follows from Proposition Q] that in this 
case T fl D = 0, i.e., the prism T is infeasible, and (DR2) from Proposition [1] and 
from the definition of /i that the current best feasible solution cannot be improved 
in T. 

5. Experimental Results 

We first provide illustrations of the proposed algorithm and its solution on toy 
examples from feature selection in Section 15. 1[ and then apply the algorithm to 
an application of discriminative structure learning using the UCI repository data 
in Section 15.21 The experiments below were run on a 2.8 GHz 64-bit workstation 
using Matlab and IBM ILOG CPLEX ver. 12.1. 

5.1. Application to feature selection. We compared the performance and solu- 
tions by the proposed prismatic algorithm (PRISM) , the supermodular-submodular 
procedure (SSP) [21], the greedy method and the LASSO. To this end, we gener- 
ated data as follows: Given p, n and k, the design matrix X £ M. nxp is a ma- 
trix of i.i.d. Gaussian components. A feature set J of cardinality k is chosen at 
random and the weights on the selected features are sampled from a standard 
multivariate Gaussian distribution. The weights on other features are 0. We 
then take y — Xw + n _1 ' 2 ||.X"t0||2€, where w is the weights on features and e 
is a standard Gaussian vector. In the experiment, we used the trace norm of 
the submatrix corresponding to J, Xj, i.e., tr(Aj Xj) 1 / 2 . Thus, our problem is 
min^gRp w^\\y — Xw\\?, + A • tr(Xj A,/) 1 / 2 , where J is the support of w. Or equiv- 
alently, mm AeV g(A) + A • tr(AjA A ) 1/2 , where g(A) := min tt , AeR |,i| \\y - X A w A || 2 . 



Since the first term is a supermodular function [1] and the second is a submodular 
function, this problem is the D.S. programming problem. 

First, the graphs in Figure [5] show the training errors, test errors and compu- 
tational time versus A for PRISM and SSP (for p = 120, n = 150 and k = 10). 
The values in the graphs are averaged over 20 datasets. For the test errors, we 
generated another 100 data from the same model and applied the estimated model 
to the data. And, for all methods, we tried several possible regularization parame- 
ters. From the graphs, we can see the following: First, exact solutions (by PRISM) 
always outperform approximate ones (by SSP). This would show the significance 
of optimizing the submodular-norm. That is, we could obtain the better solutions 
(in the sense of prediction error) by optimizing the objective with the submodu- 
lar norm more exactly. And, our algorithm took longer especially when A smaller. 
This would be because smaller A basically gives a larger size subset (solution). Also, 
Table [T] shows normalized-mean prediction errors by the prismatic algorithm, the 
supermodular-submodular procedure, the greedy method and the lasso for several 
k. The values are averaged over 10 datasets. This result also seems to show that 
optimizing the objective with the submodular norm exactly is significant in the 
meaning of prediction errors. 

5.2. Application to discriminative structure learning. Our second applica- 
tion is discriminative structure learning using the UCI machine learning repository!^ 
Here, we used CHESS, GERMAN, CENSUS-INCOME (KDD) and HEPATITIS, 
which have two classes. The Bayesian network topology used was the tree aug- 
mented naive Bayes (TAN) [35] . We estimated TANs from data both in generative 
and discriminative manners. To this end, we used the procedure described in |20j 
with a submodular minimization solver (for the generative case), and the one [3T] 
combined with our prismatic algorithm (PRISM) or the supermodular-submodular 
procedure (SSP) (for the discriminative case). Once the structures have been esti- 
mated, the parameters were learned based on the maximum likelihood method. 

Table [2] shows the empirical accuracy of the classifier in [%] with standard devi- 
ation for these datasets. We used the train/test scheme described in [6j [22] . Also, 
we removed instances with missing values. The results seem to show that optimiz- 
ing the EAR measure more exactly could improve the performance of classification 
(which would mean that the EAR is significant as the measure of discriminative 
structure learning in the sense of classification) . 



6. Conclusions 

In this paper, we proposed a prismatic algorithm for the D.S. programming 
problem ([T]), which is the first exact algorithm for this problem and is a branch- 
and-bound method responding to the structure of this problem. We developed the 
algorithm based on the analogy with the D.C. programming problem through the 
continuous relaxation of solution spaces and objective functions with the help of 
the Lovasz extension. We applied the proposed algorithm to several situations of 
feature selection and discriminative structure learning using artificial and real- world 
datasets. 



'http : //archive . ics .uci . edu/ml/ index . html 



Data Attr. Class 


exact (PRISM) approx. (SSP) generative 


Chess 36 2 
German 20 2 
Census-income 40 2 
Hepatitis 19 2 


96.6 (±0.69) 94.4 (±0.71) 92.3 (±0.79) 
70.0 (±0.43) 69.9 (±0.43) 69.1 (±0.49) 
73.2 (±0.64) 71.2 (±0.74) 70.3 (±0.74) 
86.9 (±1.89) 84.3 (±2.31) 84.2 (±2.11) 



Table 2. Empirical accuracy of the classifiers in [%} with stan- 
dard deviation by the TANs discriminatively learned with PRISM 
or SSP and generatively learned with a submodular minimization 
solver. The numbers in parentheses are computational time in sec- 
onds. 



The D.S. programming problem addressed in this paper covers a broad range of 
applications in machine learning. In future works, we will develop a series of the 
presented framework specialized to the specific structure of each problem. Also, 
it would be interesting to investigate the extension of our method to enumerate 
solutions, which could make the framework more useful in practice. 
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